Challenges for week 5

Now that we've seen how to run statistical testing and create supervised machine learning models in Python, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them.

Some important notes for the challenges:

  1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
  2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort for each challenge. This means that if one of the challenges is not delivered (not started and no attempt shown), we unfortunately will not be able to provide a full grade for that week.
  3. Delivering the challenge on time on Canvas assignment is critical, as it helps also prepare for the DA live session. Check on Canvas how to hand it in.

Facing issues?

We are constantly monitoring the issues on the GitHub general repository (https://github.com/uva-cw-digitalanalytics/2021s2/issues) to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving.

Important: We are only monitoring the repository in weekdays, from 9.30 to 17.00. Issues logged after this time will most likely be answered the next day. This means you should now wait for our response before submitting a challenge :-)

Getting setup for the challenges

We will use the Google Store data that we also saw in the video tutorials. Make sure to either have it by cloning the general repository, or downloading it from surfdrive (see link in the General Repository homepage) and placing it in the same folder as you are running this weekly challenge.



The case

Our website has launched new campaigns to increase in sales (as binary, converted from order_euros) and revenue (order_euros).

We are interested in two campaigns:

We want to know if (a) each campaign led to an increase in sales compared to the other campaigns (i.e., any traffic source that is not set as CPC or referral) and, (b) if one campaign led to more sales than the other.

Both dependent variables (sales and revenue) should come from the order_euros variable.

We also want to understand how the device that someone has, and the location that someone is in, influence sales and revenue. This means you need also to create two additional independent variables:

Important note:

Because the dataset is very large and it may take some time to run the code, we will select a random sample of 10% of the visits that are in the dataset. Please run the code below (exactly as it is):


Challenge 1

Create a RQ and hypothesis (or hypotheses) based on the case description above, and prepare the dataset and the variables needed to answer the RQ and hypothesis.

When everything is done:

Tip:

For sales (i.e., whether someone made a purchase or not), you will need to transform a continuous variable (order_euros) into a binary variable (0 = no purchase, 1 = purchase).


Data inspection

Before creating RQs and hypotheses, I want to inspect the dataset first.

As I want to explore the influence to sales and revenue, I want to pick the categories that can influence the users' purchase abilities. I want to explore how the users of Apple devices, and the users from United States influence sales and revenue, as these two groups have reletively higher purchase abilities compared to other categories — Apple devices are in general more expensive than other devices, and United States is the country that has the highest GDP in the world.

According to the case description and data inspection, I form research questions and hypotheses as following:
RQ1: To what extent have the new campaigns (CPC and referral) increased revenue and sales compared to other traffic sources to the website?

Hypotheses for sales:

Hypotheses for revenue:

RQ2: To what extent do the CPC and referral campaigns differ in the total revenues they bring?
RQ3: To what extent do the CPC and referral campaigns differ in the sales (the purchase behavior) among users?

RQ4: To what extend does the ownership of Apple devices have an influence on revenue and sales?
RQ5: To what extend does the location of United States have an influence on revenue and sales?

The needed IVs for this exercise are:

The needed DVs for this exercise are:

A few variables for the controls: Android devices, and United States

Data cleaning

From the result I discover that the column order_euros only contains the data from someone who made purchase, because the minimal value in this column is 1. Combining with the result that this column has 2390 missing values, the missing values should be the data for someone who did not make purchase, and should be filled with 0.

Then I can create the needed variables according to the list above.

Descriptive statistics and univariate visualization for IVs

In this dataset, there are 545 data from referral campaign, 1326 from CPC campaign, and 3360 data from other caimpaigns. The number of other campaigns are way more than CPC and referral campaigns (the number for referral campaign is especially small), which may mean that this dataset (or the sample I acquire from this dataset) is not balanced.

The right thing to do here should be to require another sample of the visits dataset that contains more balanced number for the three types of campaigns, or to include more cases in the sanple, but I will go on because the random state is the same for everyone.

Descriptive statistics and univariate visualization for DVs

From the result I find that more users bought something in the website — there were 2390 users did not purchase and 2841 users purchased. The largest revenue generated was €999.

Visualizations for hypotheses

I have five pairs of hypothesis:

H1a. Users entering the website via the CPC campaign will be more likely to make a purchase compared to users from other traffic sources.
H1b. Users entering the website via the referral campaign will be more likely to make a purchase compared to users from other traffic sources.

H2a. Users entering the website via the CPC campaign will have more expensive orders compared to users entering from other traffic sources.
H2b. Users entering the website via the referral campaign will have more expensive orders compared to users entering other traffic sources.

H3a. Users entering the website via the CPC campaign will be more likely to make a purchase compared to users entering the website via the referral campaign.
H3b. Users entering the website via the CPC campaign will have more expensive orders compared to users entering the website via the referral campaign.

H4a. Users who have Apple devices will be more likely to make a purchase compared to users who have other kinds of devices.
H4b. Users who have Apple devices will have more expensive orders compared to users who have other kinds of devices.

H5a. Users from United States will be more likely to make a purchase compared to users from other countries.
H5b. Users from United States will have more expensive orders compared to users users from other countries.

From the visualization I find that users entering the website via the CPC campaign were more likely to make a purchase compared to users from other traffic sources. The difference is statistically significant according to the confidence interval. From the visualization, H1a is confirmed by the past data.

Users entering the website via the referral campaign were less likely to make a purchase compared to users from other traffic sources, but the difference is not so significant. From the visualization, H1b is not supported by the past data.

Users entering the website via the CPC campaign had more expensive orders compared to users entering from other traffic sources. The difference is statistically significant. From the visualization, H2a is confirmed by the past data.

Although they were less likely to make a purchase, users entering the website via the referral campaign had more expensive orders compared to users entering from other traffic sources. The difference is statistically significant. From the visualization, H2b is confirmed by the past data.

Users entering the website via the CPC campaign were more likely to make a purchase compared to users entering the website via the referral campaign. The difference is statistically significant. From the visualization, H3a is confirmed by the past data.

Users entering the website via the CPC campaign had more expensive orders compared to users entering the website via the referral campaign, but the difference is not statistically significant. From the visualization, H3b is not supported by the past data.

Users who have Apple devices were less likely to make a purchase compared to users who have other kinds of devices. The difference is not statistically significant. From the visualization, H4a is not supported by the past data.

Users who have Apple devices had less expensive orders compared to users who have other kinds of devices. The difference is statistically significant. From the visualization, H4b is rejected by the past data.

Users from United States were more likely to make a purchase compared to users from other countries. The difference is not statistically significant. From the visualization, H5a is not supported by the past data.

Users from United States had more expensive orders compared to users users from other countries. The difference is statistically significant. From the visualization, H5b is confirmed by the past data.

Descriptives of the DV grouped by the IV

From the result of the groupby chart, I find that:


Challenge 2

For this challenge, we would like you to focus on sales (binary variable) as the DV.

You need to test the hypotheses and make predictions for each campaign using ML. In other words, you need to:

  1. Create statistical models that (a) test whether the campaigns lead to a higher likelihood of a sale than the other campaigns, and (b) test whether the referral leads to a higher likelihood of a sale than the cpc campaign
  2. Use ML to create similar models (as in 1), and use them to run predictions (e.g., what is the likelihood of a sale if someone came to the website via the referral campaign? Or from the CPC campaign? Or from other campaigns?)
  3. Use LIME to explain the predictions created by the model, contrasting the importance of campaigns with the importance of device type and of location.

Don't forget to interpret the results in MarkDown, and indicate whether your hypotheses were supported, not supported (or even rejected).


Regression

Because sales is a binary variable, for this exercise, I use logistic regression.

My hypothese related to sales that are waited to be tested by the model are:

H1a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign will on average, more likely to make a purchase compared to users from other traffic sources (p < .001).

H1b: Referral campaign has a negative effect on sales. Users entering the website via the referral campaign will be less likely to purchase something compared to users from other traffic sources. However, the p value (p = 0.87) is larger than 0.05, indicating that the negative effect is not obvious.

H4a: Users who have Apple devices negatively predicts the purchase behavior, but the difference between users of Apple devices and other devices is not significant (p > .05).

H5a: The partial effect of the location of US is positive but not statistically significant (p > .05).

H3a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign are on average, more likely to make a purchase compared to users entering the website via the referral campaign (p < .001).

Hypothesis Result
H1a confirmed
H1b not supported
H3a confirmed
H4a not supported
H5a not supported

ML

For someone come to the website via the CPC campaign, not using Apple device and not from America, they have 31.8% possibility to not purchase anything, and 68.2% possibility to purchase something.

For someone come to the website via the CPC campaign, using Apple device and not from America, they have 32.1% possibility to not purchase anything, and 67.9% possibility to purchase something.

For someone come to the website via the CPC campaign, using Apple device and from America, they have 31.5% possibility to not purchase anything, and 68.5% possibility to purchase something.

For someone come to the website via the CPC campaign, not using Apple device and from America, they have 31.2% possibility to not purchase anything, and 68.8% possibility to purchase something.

For someone come to the website via the referral campaign, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something.

For someone come to the website via the referral campaign, using Apple device and not from America, they have 51.3% possibility to not purchase anything, and 48.7% possibility to purchase something.

For someone come to the website via the referral campaign, using Apple device and from America, they have 50.6% possibility to not purchase anything, and 49.4% possibility to purchase something.

For someone come to the website via the referral campaign, not using Apple device and from America, they have 50.2% possibility to not purchase anything, and 49.8% possibility to purchase something.

For someone come to the website via other campaigns, not using Apple device and not from America, they have 50.7% possibility to not purchase anything, and 49.3% possibility to purchase something.

For someone come to the website via other campaigns, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something.

For someone come to the website via other campaigns, using Apple device and from America, they have 50.2% possibility to not purchase anything, and 49.8% possibility to purchase something.

For someone come to the website via other campaigns, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something.

LIME

For someone come to the website via the CPC campaign, not using Apple device and not from America, they will have 32% possibility to not purchase anything, and 68% possibility to purchase something. The CPC campaign will make them more likely to purchase, while their locations make them less likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

For someone come to the website via the CPC campaign, using Apple device and not from America, they have 32% possibility to not purchase anything, and 68% possibility to purchase something. The CPC campaign will make them more likely to purchase, while their locations make them less likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

For someone come to the website via the CPC campaign, using Apple device and from America, they have 31% possibility to not purchase anything, and 69% possibility to purchase something. The CPC campaign and location of US will make them more likely to purchase. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone came to the website via the CPC campaign, not using Apple device and from America, they have 31% possibility to not purchase anything, and 69% possibility to purchase something. The CPC campaign and location of US will make them more likely to purchase. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone come to the website via the referral campaign, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone come to the website via the referral campaign, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone come to the website via the referral campaign, using Apple device and from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

For someone come to the website via the referral campaign, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

For someone come to the website via other campaigns, not using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone come to the website via other campaigns, using Apple device and not from America, they have 51% possibility to not purchase anything, and 49% possibility to purchase something. The campaign and their locations will make them less likely to purchase goods. In this case, the usage of Apple devices does not influence the users' decision of purchase, while the effect of location is not large.

For someone come to the website via other campaigns, using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

For someone come to the website via other campaigns, not using Apple device and from America, they have 50% possibility to not purchase anything, and 50% possibility to purchase something. The campaign will make them less likely to purchase goods, while their locations make them more likely to purchase (this is not a large effect). In this case, the usage of Apple devices does not influence the users' decision of purchase.

Through the comparison I find that the difference of device type and location — comparing with the difference of campaign — cannot determine users' purchase behavior largely.


Challenge 3

For this challenge, we would like you to focus on revenue (continuous variable) as the DV.

You need to test the hypotheses and make predictions for each campaign using ML. In other words, you need to:

  1. Create statistical models that (a) test whether the campaigns lead to a higher revenue than the other campaigns, and (b) test whether the referral leads to a higher revenue of a sale than the cpc campaign
  2. Use ML to create similar models (as in 1), and use them to run predictions (e.g., what is the expected revenue if someone came to the website via the referral campaign? Or from the CPC campaign? Or from other campaigns?)
  3. Use LIME to explain the predictions created by the model, contrasting the importance of campaigns with the importance of device type and of location.

Don't forget to interpret the results in MarkDown, and indicate whether your hypotheses were supported, not supported (or even rejected).


Regression

Because revenue is a continuous variable, for this exercise, I use linear regression.

My hypothese related to revenue that are waited to be tested by the model are:

H2a: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign will on average, order way more expensive things compared to users from other traffic sources (p < .001).

H2b: Referral campaign also has a positive effect on sales. Users entering the website via the referral campaign will increase the revenue more compared to users from other traffic sources (p < .001).

H4b: Users who have Apple devices negatively predicts revenue, but the difference between users of Apple devices and other devices is not significant (p > .05).

H5b: The partial effect of the location of US is positive but not statistically significant (p > .05).

H3b: The b coefficient for cpc campaign is positive, indicating a positive effect: users entering the website via the CPC campaign are on average, order more expensive things compared to users entering the website via the referral campaign (p < .001).

Hypothesis Result
H1a confirmed
H1b not supported
H2a confirmed
H2b confirmed
H3a confirmed
H3b confirmed
H4a not supported
H4b not supported
H5a not supported
H5b not supported

ML

The expected revenue for someone come to the website via the CPC campaign, not using Apple device and not from America, is €442.5.

The expected revenue for someone come to the website via the CPC campaign, using Apple device and not from America, is €435.4.

The expected revenue for someone come to the website via the CPC campaign, using Apple device and from America, is €449.6.

The expected revenue for someone come to the website via the CPC campaign, not using Apple device and from America, is €456.8.

The expected revenue for someone come to the website via the referral campaign, not using Apple device and not from America, is €395.9.

The expected revenue for someone come to the website via the referral campaign, using Apple device and not from America, is €388.8.

The expected revenue for someone come to the website via the referral campaign, using Apple device and from America, is €403.

The expected revenue for someone come to the website via the referral campaign, not using Apple device and from America, is €410.1.

The expected revenue for someone come to the website via other campaigns, not using Apple device and not from America, is €95.4.

The expected revenue for someone come to the website via other campaigns, using Apple device and not from America, is €88.3.

The expected revenue for someone come to the website via other campaigns, using Apple device and from America, is €102.5.

The expected revenue for someone come to the website via other campaigns, not using Apple device and from America, is €109.6.

LIME

In this section, I do not know why LIME shows garbled texts and makes the predicted value very hard to read. It seems that the number I want to visualize is too large so they crush with each other.

The expected revenue for someone come to the website via the CPC campaign, not using Apple device and not from America, is €442.5.

The expected revenue for someone come to the website via the CPC campaign, using Apple device and not from America, is €435.38.

The expected revenue for someone come to the website via the CPC campaign, using Apple device and from America, is €449.63.

The expected revenue for someone come to the website via the CPC campaign, not using Apple device and from America, is €456.75.

The expected revenue for someone come to the website via the referral campaign, not using Apple device and not from America, is €395.88.

The expected revenue for someone come to the website via the referral campaign, using Apple device and not from America, is €388.75.

The expected revenue for someone come to the website via the referral campaign, using Apple device and from America, is €403.

The expected revenue for someone come to the website via the referral campaign, not using Apple device and from America, is €410.13.

The expected revenue for someone come to the website via other campaigns, not using Apple device and not from America, is €95.38.

The expected revenue for someone come to the website via other campaigns, using Apple device and not from America, is €88.25.

The expected revenue for someone come to the website via other campaigns, using Apple device and from America, is €102.5.

The expected revenue for someone come to the website via other campaigns, not using Apple device and from America, is €109.63.


Important exception

As LIME is a framework still in development, we are not sure if it will work in all computers and configurations. If by any chance you get an error message when running LIME, hand in the challenge anyway (showing the error message), and we will accept it as complete.